Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubevirt: Defer eve reboot/shutdown/update until drain completes #4494

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

andrewd-zededa
Copy link
Contributor

@andrewd-zededa andrewd-zededa commented Dec 23, 2024

As a part of kubevirt-eve we have multiple cluster nodes each hosting app workloads and volume replicas.
This implements defer for eve mgmt config operations which will result in unavailability of storage
replicas until the cluster volume is not running on a single replica.

An example:

  1. Node 1 outage and recovers.
  2. Before volumes complete rebuilding on node 1 there is a node 2 outage and recovery.
  3. Volumes begin rebuilding replicas on nodes 1 and 2. Only available rebuild source is on node 3.
  4. User initiated request to reboot/shutdown/update eve-os on node 3.
  5. That config request is set to defer until replicas are rebuilt on the other nodes.

At a high level the eve-side workflow looks like this:

  1. eve config received requesting reboot/shutdown/baseos-image-change to node 1
  2. drain requested for node 1
  3. zedkube cordons node 1 so that new workloads are blocked from scheduling on that node.
  4. zedkube initiates a kubernetes drain of that node removing workloads
  5. As a part of drain, PDB (Pod Disruption Budget) at longhorn level determines local replica is the last online one.
  6. Drain waits for volume replicas to rebuild across the cluster.
  7. Drain completes and NodeDrainStatus message sent to continue original config request.
  8. On the next boot event zedkube nodeOnBootHealthStatusWatcher() waits until the local kubernetes node comes online/ready for the first time on each boot event and uncordons it, allowing workloads to be scheduled.

Note: For eve baseos image updates this path waits until a new baseos image is fully available locally (LOADED or INSTALLED) and activated before beginning drain.

This PR implements original api added in #4366.

@andrewd-zededa
Copy link
Contributor Author

  ./out has been created
  Modes: GitHubActions Robot InContainer ResetRepo UnitTests
  Processing: GH:4494
  GITHUB PR #4494 is being downloaded from
  https://api.github.com/repos/lf-edge/eve/pulls/4494
    JSON data at Mon Dec 23 11:09:31 PM UTC 2024
    Patch data at Mon Dec 23 11:09:32 PM UTC 2024
  ERROR: Unsure how to process GH:4494. Permissions missing?

An odd yetus failure, not sure what to make of this.

@andrewd-zededa
Copy link
Contributor Author

yetus checking seems broken:

2025-01-18T02:42:21.7383936Z ./out has been created
2025-01-18T02:42:22.0153173Z Modes: GitHubActions Robot InContainer ResetRepo UnitTests
2025-01-18T02:42:22.0153688Z Processing: GH:4494
2025-01-18T02:42:22.0158046Z GITHUB PR #4494 is being downloaded from
2025-01-18T02:42:22.0158544Z https://api.github.com/repos/lf-edge/eve/pulls/4494
2025-01-18T02:42:22.0172328Z JSON data at Sat Jan 18 02:42:22 AM UTC 2025
2025-01-18T02:42:22.5937865Z Patch data at Sat Jan 18 02:42:22 AM UTC 2025
2025-01-18T02:42:23.0465375Z ERROR: Unsure how to process GH:4494. Permissions missing?

@andrewd-zededa andrewd-zededa marked this pull request as ready for review January 20, 2025 16:15
@@ -77,6 +77,7 @@ type getconfigContext struct {
configGetStatus types.ConfigGetStatus
updateInprogress bool
readSavedConfig bool // Did we already read it?
drainInProgress bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code reads as if this field has the semantics of waitForDrainComplete (or deferForDrain). Is that the semantics in zedagent?

Copy link
Contributor Author

@andrewd-zededa andrewd-zededa Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the above updateInprogress usage to allow drain handling / deferred ops in zedagent.go:handleNodeAgentStatusImpl()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the name potentially confusing since this is really about waiting and not starting/triggering a drain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following now, I can reword it

if ctx.poweroffCmdDeferred &&
drainInProgress && !status.DrainInProgress {
log.Functionf("Drain complete and deferred poweroff")
ctx.poweroffCmdDeferred = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the code which sets poweroffCmdDeferred due to a drain.
Did you miss a function which sets the various deferred fields due to the local profile server API being used? (I don't recall whether that is a separate function from the controller API.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW that code is in pkg/pillar/cmd/zedagent/localinfo.go and presumably it needs the drainInProgress checks for poweroff, shutdown, and reboot.

}

log.Noticef("handleNodeDrainStatusImplNA to:%v", newStatus)
// NodeDrainStatus Failures here should keep drainInProgress set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is something catastrophic happening to a cluster (e.g., two out of three nodes die due to hardware failures), is there a way to recover from that and make the single remaining node be usable e.g., as a one node cluster or as a standalone device?

It seems like a reboot might not be usable to recover since (based on this comment) you can't reboot until the drain has succeeded, and that is presumably impossible if the two other nodes died while you were trying to drain.
What happens when the drain timer fires?

Is there a recovery case which does not involve a truck roll and a reinstall of EVE in such a case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only a single cluster node is usable then etcd will be unable to form a quorum and k3s won't be able to meet the ready state.

If the cluster is at the state of the local kubernetes node not running then applications should already be unavailable and the zedkube drain handler should just skip requesting drain. There is already a handler to skip drain for one kubernetes outage type and return complete but I do see how this needs to be expanded to handle more kubernetes outage cases and more clearly show that zedkube is allowing these controller operations to be handled as recovery mechanisms and not just maintenance.

The key here is to make sure to appropriately debounce these intermittent kubernetes api issues we could see due to timeouts. This pr has a constant value of 180 seconds but I'm going to move this to a config-property to allow some modification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of a different sequence.

  1. Cluster is running with three nodes.
  2. Controller (or LOC) requests a device reboot or EVE update of node X
  3. Drain starts
  4. Node Y fails/disappears from the cluster for any reason (hardware, software crash, network partition)

In such a case should we proceed with the device rebooot/update of node X? Or pause until Y is back in the cluster?

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the detailed comments, I think the text in the PR description should go in a markdown file.

Also, at a high-level what are the pre-conditions for the reboot/shutdown/update?
Is it that all three nodes being healthy?
(That would ensure that if there is a hardware failure or crash of another node while this one is rebooting/updating that there is still one node left.)

If while the drain is in progress one of the other nodes in the cluster dies, should the reboot/update be deferred until that node is back?

if override != nil {
zedkubeCtx.pubNodeDrainStatus.Publish("global", override)
}
zedkubeCtx.drainOverrideTimer.Reset(5 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you Stop the timer here as well like in the initial case?

// Just print the transitions which are linked to lengthy operations or errors
return
}
ctx.ph.Print("INFO: Node Drain -> %s at %v\n", newStatus.Status.String(), ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also go in the new console TUI? Check with @rucoder

@andrewd-zededa
Copy link
Contributor Author

Hi @eriknordmark thanks for the review, i'll address the docs changes and cleanup.

As a part of kubevirt-eve we have multiple cluster nodes each hosting app workloads and volume replicas.
This implements defer for eve mgmt config operations which will result in unavailability of storage
replicas until the cluster volume is not running on a single replica.

An example:

1. Node 1 outage and recovers.
2. Before volumes complete rebuilding on node 1 there is a node 2 outage and recovery.
3. Volumes begin rebuilding replicas on nodes 1 and 2. Only available rebuild source is on node 3.
4. User initiated request to reboot/shutdown/update eve-os on node 3.
5. That config request is set to defer until replicas are rebuilt on the other nodes.

At a high level the eve-side workflow looks like this:

1. eve config received requesting reboot/shutdown/baseos-image-change to node 1
2. drain requested for node 1
3. zedkube cordons node 1 so that new workloads are blocked from scheduling on that node.
4. zedkube initiates a kubernetes drain of that node removing workloads
5. As a part of drain, PDB (Pod Disruption Budget) at longhorn level determines local replica is the last online one.
6. Drain waits for volume replicas to rebuild across the cluster.
7. Drain completes and NodeDrainStatus message sent to continue original config request.
8. On the next boot event zedkube nodeOnBootHealthStatusWatcher() waits until the local kubernetes node comes online/ready for the first time on each boot event and uncordons it, allowing workloads to be scheduled.

Note: For eve baseos image updates this path waits until a new baseos image is fully available locally (LOADED or INSTALLED) and activated before beginning drain.

Signed-off-by: Andrew Durbin <[email protected]>
Main dependencies listed here, others pulled in for version conflicts.
k8s.io/kubectl/pkg/drain
github.com/longhorn/longhorn-manager

Signed-off-by: Andrew Durbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants